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On a chain of fragmentation eqnations for duplication-mutation dynamics in DNA 

sequences 
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Recent studies have revealed that for the majority of species the length distributions of duplicated 
sequences in natural DNA follow a power-law tail. We study duplication-mutation models for 
processes in natural DNA sequences and the length distributions of exact matches computed from 
both synthetic and natural sequences. Here we present a hierarchy of equations for various number 
of exact matches for these models. The reduction of these equations to one equation for pairs of 
exact repeats is found. Quantitative correspondence of solutions of the equation to simulations is 
demonstrated. 

PACS numbers: 


INTRODUCTION 

In recent years a series of duplication-mutation models 
related to processes occurring in natural DNA sequences 
has been reported m- The motivation for introduc¬ 
ing these models were earlier empirical observations on 
length distributions |4] of identical repeats in natural 
DNA sequences [SI [5]. In part it was observed that when 
computing the length distributions within single chro¬ 
mosomes or whole genome sequences these distributions 
tended to exhibit power-law tails with the exponent close 
to — 3[T0|. These observations naturally drew attention 
to potential mechanisms accounting for them. 

The first step for explanation of these distributions 
was done in [T] where empirical computational models 
of chromosome evolution based on a mechanism of dupli¬ 
cations were suggested. The duplications in these models 
were thought of as random events of copying and past¬ 
ing a part of the chromosome. If we copy a part and 
substitute it to another place of the chromosome, then 
each such event typically results in the appearance of 
a pair of identical sequences which then undergo further 
destruction by new duplication events and eventually dis¬ 
appear but as the model generated new pairs at each time 
unit some balance in the number of duplicates might be 
expected. It was demonstrated that this evolutionary 
model with random duplications generates length distri¬ 
butions of exact matches or mMxmers\\l\ with power-law 
tails; it was also demonstrated that the slope of these 
tails with the exponent —3 can be obtained in the model 
by varying a parameter responsible for the length of the 
sequences which copy-pasted at each time step: this ran¬ 
dom mechanism producing new pairs of exact matches is 
further referred to as source of duplications; it is charac¬ 
terized by several parameters, e.g., by the length of the 
region for copying-pasting which is chosen in accordance 
with some probability distribution. Thus, this model in- 
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dicated a neutral mechanism which generated algebraic 
tails in the length distributions of exact matches and pro¬ 
vided first qualitative explanation of the corrersponding 
observations in natural genomes. 

The models less dependent of the source of duplica¬ 
tions but incorporating additional mechanisms for gen¬ 
erating heavy algebraic tails in length distributions of 
exact matches were represented in [H |3]. Unlike [T] 
two basic mechanisms utilized in the models, duplica¬ 
tion as in ^ and point mutation, reflect those in natu¬ 
ral chromosomes. It was demonstrated that the length 
distributions [4] of repetitive sequences simulated by the 
models correspond to those observed in natural chromo¬ 
somes and that the form of those distributions also was 
close to algebraic with exponents of typically around —3. 
Thus the models in question were able to reproduce these 
exponents and even the amplitudes of the distributions 
were fitted |3] but unlike [T], the structure of the duplica¬ 
tion source did not influence the exponent —3 of length 
distributions in certain parameter regime. 

The important feature of the models m was the defi¬ 
nition of pairs of exact repeats. In mm the authors used 
supermaximal repeats as the basic type of exact match. 
Supermaximal repeats are described in HU; they rep¬ 
resent a subset of exact matches with additional condi¬ 
tions of maximality at the ends. On the other hand, 
the work |5] relies on the definition of exact repeats as 
they are computed by mummer but also applies additional 
post-processing, imitating, to our view, the definition of 
supermaximal repeats |5]. Nevertheless, the distinctive 
feature observed for the length distributions in |5] was 
the algebraic behavior of the tails for a broad range of 
parameters, while [3] demonstrated that when mutations 
occurred as often as duplications (simplistically speak¬ 
ing), the algebraic behavior disappeared; this point is 
discussed in more detail in |3]. Thus, this observation in¬ 
dicated that the definition of exact repeats influence the 
output length distributions. 

Thus, the duplication-mutation model in fact is deter¬ 
mined by two components: a) evolutionary mechanisms 
applied to the synthetic chromosome, in our case, dupli- 
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FIG. 1: The figure represents random duplications as they 
appear in the synthetic sequence. A random sequence of the 
fixed length D (red bar) is chosen from the chromosome (blue 
bar) and copied into another randomly chosen place of the 
chromosome thus producing a pair of exact matches. Simul- 
taniously point substitutions are applied to the whole chro¬ 
mosome with some rate. Length distributons of such pairs 
(with restrictions layed by mummer) is computed and analyzed 
throughout the paper. 


cations and point substitutions and b) the definition of 
how to compute the length distributions, i.e., de facto, 
how we count exact matches. 

In this paper we I) rely on mummer in our computation 
of the exact repeats following [5] but do not apply ad¬ 
ditional postprocessing to portrey supermaximal repeats, 
thus, our counting is different both from and 0; 2) 
suggest dynamic equations reproducing both the expo¬ 
nent and the amplitude of the length distribution for that 
counting; 3) demonstrate that the stationary equation 
that we derived, reproducing the amplitude and the ex¬ 
ponent for length distributions of pairs of exact repeats 
can be represented as a (infinite) sum or a chain of equa¬ 
tions for different types of exact repeats; 4) demonstrate 
that the equation for supermaximal repeats from [3] is 
incorporated in the chain of equations we introduce for 
various types of exact matches. 


MODEL 

The evolutionary mechanisms used in numerical sim¬ 
ulations of the model correspond to mm- a detailed 
explanation of these duplication-mutation models can be 
found, e.g., in [3] but we summarize them in this section. 

The layout of the model is shown in fig. We consider 
a synthetic chromosome (blue bar in fig. represented 
as a string of L bases chosen from a finite alphabet; in 
natural genomes the alphabet consists of four bases A, G, 
C, and T. The distance between bases is a length scale 
denoted by a; for natural genomes it is close to lA. 

Within our models a subsequence of length D (red bar 


in fig. 0 is chosen randomly within the chromosome 
and is substituted for a sequence of length D at another 
randomly chosen position in the chromosome (fig. [^. 
These duplications are assumed to occur with the rate A 
measured per time unit, per base. Simultaniously point 
substitutions are applied to the system with the rate p, 
per time unit, per base. 

The sequence feature that we study is the set of re¬ 
peated sequences within the chromosome. For finding all 
pairs of exact matches in the synthetic sequence we ap¬ 
ply mummer. Mummer searches for maximal repeats or 
maxmers\ll\ which are akin to suoermaxmers [T5] men¬ 
tioned in the previos section and used in [3] in the sense 
that computation of both sets is based on some maximal- 
ity condition. However, the set of exact matches com¬ 
puted by mummer is larger than the set of supermaxmers 
of the same length as the definition of the latter includes 
additional restrictions. Then the observations show that 
the output of these computations is noticeably different 
if we compare the length distributions obtained in the 
models [2] and [3]. Our aim here is the model capa¬ 
ble to reproduce the simulated length distributions ob¬ 
tained with mummer without any additional restriction 
as well as an equation for the simulated length distribu¬ 
tions. In the discussion below it is always implied that 
mummer is used with the option -maxmatch which ac¬ 
cording to the mummer manual produces computations 
of exact matches ‘regardless of their uniqueness’ [H]. The 
Appendix section also contains more rigorous definitions 
of various types of repeats. However for the purposes of 
the analytic derivation suggested below it is sufficient to 
think that the equations aim to reproduce the length dis¬ 
tributions constructed for the set of repeats obtained by 
mummer, a standard tool in comparative analysis of long 
DNA. 


ANALYTIC TREATMENT 


Let the number of pairs of duplicates of the length m 
at time moment t is g 2 {t,m). We assume that new du¬ 
plication events occur with the rate A per base, per time 
unit; at the same time the chromosome undergoes point 
mutation events occurring with the rate p per base, per 
time unit. We first write down the evolutionary (balance) 
equation for the average number of pairs of duplicates g 2 , 
which was derived in [3]; it has the form 
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^ g 2 {t,k)L^SciD - m). (1) 

The main difference between this equation and the equa¬ 
tion of [5] is notation (we use g 2 here instead of /). In 
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addition, there is no prefactor 2 in the last term of the 
equation because in [3] we studied the number of dupli¬ 
cated sequences while here we look at the number of pairs 
of duplicates] thus, the source produces one pair of du¬ 
plicates at each time step. We also confine ourselves to 
the equation for the monoscale source using Kronecker 
delta function Sc{D — m); different source terms are also 
possible and will be presented elsewhere. Thus the equa¬ 
tion Q is provided for the reference and connection to 
the subsequent discussion. 

We will then focus on the stationary version of the 
equation implying that when t —)■ oo g 2 (t,m) —)■ g 2 {'m) 
(this can be demonstrated by analytic calculation) 
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Now in the same way as we looked at pairs of identi¬ 
cal duplicates we can look at triplets, quadruplets, etc. 
of identical sequences and write down the corresponding 
equations for them. For f-plets we will have the following 
stationary equation 
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{'i-4)(^{D-m-\-a)] g^-i{m)-\-2{i-l)^ ^ 5i-i(fc), f > 2 


(3) 


We see that unlike the equation for duplicates containing 
the source term with the delta function in it, other equa¬ 
tions also have sources of new f-plets ; these sources are 
i — 1-plets and expressed by the last two terms in ([^. One 
produces f-plicates of i — 1 -plicates of the same length m 
(the first term in the second line of ([^); the other gen¬ 
erates f-plicates of longer i — 1 -plicates by copying and 
pasting their parts of the length m (the second term in 
the second line of i.e., new duplicates, 92 ( 1 ^) gen¬ 
erated by the source, in turn produce triplicates g^{fa), 
where m < m, triplicates produce quadruplicates 34 etc. 
The first term in the first line of (|^ is responsible for the 
destruction of sequences by new duplications and point 
mutations; coefficients represent the corresponding rates. 
The second term in the first line of (|^ shows that longer 
sequences are turned into shorter ones, again, by duplica¬ 
tions and point mutations. The general mechanism has 
much in common with models studied in fragmentation 
theory[T3. This similarity is also discussed below. 

Thus for each m = 1... H we have a set of equations 
for various sets of identical repeats (maxmers). As it 
was demonstrated in [3] the equation for g 2 fits well to 
the length distribution of supermaxmers computed for 
the synthetic chromosome after applying evolutionary 
duplication-mutation dynamics described above. Equa¬ 
tions for different types of repeats, to our knowledge, 
were not obtained earlier. We refer to this set of equa¬ 


tions as chain because as it is easily seen functions gi 
represented in the f-equation are related to the “adja¬ 
cent” functions gi-i and gi+i- 

Using these equations we can obtain the equation cor¬ 
responding to the length distributions of exact matches 
computed by mummer as follows. We sum up all the equa¬ 
tions for i = 1 , 2 ,... and find a new equation for the 
function G{m) = '^igi(m)] the equation has the form 

— (C -I- 2)mG{m) 2aG{m) 2(C -|- 2)a G{n)-\- 

n>m 

+ L5c{D - m) = 0, (4) 

where f = Dp/aX is a dimensionless parameter. 

Now we can compare the results of the simulations with 
the solutions of Q; the comparison is represented in fig. 

Additional comparisons for different sets of parame¬ 
ters are given in supplemental figures (see Supplemental 
materials). Let us now compare solutions of the equa¬ 
tion presented in [5] with the simulations of the same 
duplication-mutation dynamics. For that we used equa¬ 
tion (5) of supplemental materials of [5]. Comparisons 
are represented in fig. The solutions of [2] provide 
a good agreement for sufficiently large mutation rates 
compared to the duplication rate A but fail to repro¬ 
duce the amplitude of the length distributions for dif¬ 
ferent regimes. In this regime saturation is observed wrt. 
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FIG. 2: Curves represent stationary length distributions ob¬ 
tained from simulations of duplication-mutation dynamics de¬ 
scribed in the previous section with a monoscale source for 
various base substitution rates and corresponding analytic 
solutions (orange) of 0. The chromosome length L — 10®; 
source length D — 10 , duplication rate A = lO”'^; for sim¬ 
ulations we always take a = 1. Length distributions for the 
same dynamics computed by mummer |14| were obtained us¬ 
ing the following options -maxmatch -n -b -I 20. The results 
were then averaged over 10^ realizations. 



FIG. 3: Curves represent stationary length distributions ob¬ 
tained from simulations of duplication-mutation dynamics 
with a monoscale source for various base substitution rates 
^ and corresponding analytic solutions (magenta curves) of 
eq. (5) of [5]. All parameters for the simulations and the 
equation are the same as for fig. The results of simulations 
were averaged over 10^ realizations. 


the amplitude of the length distributions which is repro¬ 
duced by solutions Q as seen in fig. and supplemental 
figures 1 and 2 [IB]. 

One then can easily understand the qualitative cor¬ 
respondence of length distributions observed in [ 2 ] and 
[3] for high mutation rates: the growth of mutation rate 
/i evidently affects gi{m) for larger i as the growth of 
i means more sequences in the set which are destroyed 
faster affected by mutations. Thus the main contribu¬ 


tion to G{m) for high mutation rates comes from 52 (w) 
, i.e., G{m) ~ 52 (w) as C 00 and the dynamics is de¬ 
scribed by ([^ in the main order. Also it is instructive 
to note that the situation /r 3> A generally implies C ^ 1 
and one can neglect in 0 all terms compared to those 
containing C, and the source term with delta function to 
keep the algebraic tail, hence L/a has to grow as ^ C 
to keep the same order of the source term 5c{D — m), 
otherwise the tail disappears as it is seen from fig. 
for large fi: here C is growing but the length L remains 
fixed. However this is not applicable even for C ~ 1. On 
the other hand, if /i <C A then C 0 &n.d we can write 
down the equation corresponding to the limit of absent 
mutations as C becomes negligible compared to 1 . 


— 2mG{m) + 2aG{m) + 4a ^ G{n) + L6c{D — m) = 0. 

n>m 

(5) 

If D is fixed as in figs. EE then the limit amplitude 
of the algebraic tail is controlled by the only parameter 
L and all distributions with decreasing asymptotically 
have the saturation line; this line establishes an upper 
boundary for fitting the model to the natural sequence. 
This also can be seen from the exact solution of ([^ that 
has the form 


aDL 


G{m) = 


(m — a)m{m + a) 
L 


2{D-ay 


, m < D 
m = D 


with obvious main order term ~ 1/m^ as a <C to. The 
solution is applicable if a <C 43 <C T; otherwise finite size 
effects turn out to be strong. 

The existence of saturation also can be viewed from 
the continuum limit of the dynamics under consideration. 
Introducing dimensionless variables 

a _ TO - L 


so that D corresponds to 1, we see that the dimensionless 
size of the lattice a <C 1 and hence a —)■ 0. We then 
denote fh = x and taking into account that LjD ^ 1, 
we also take L —> 00 ; other parameters may vary. Then 
Zi5c(l — a;) turns into Dirac delta and the equation 0 
takes the form 

poo 

— (C 2)xG(x) + 2(^ + 2) / G(y)dy H- 5(1 — x) = 0. 

J X 


This equation corresponds to the stationary form of eq. 
( 1 ) in m- Its solution is 


G{x) 


1 


<5(1 -x) 2 

X x^ 


( 6 ) 


The function has the exponent —3 for all x G (0,1). It 
is seen that the apmplitude of the distribution G(x) is 
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FIG. 4: The length distribution for repeat-masked C. ele- 
gans chromosome 2 was computed using mummer with the 
options -maxmatch -n -b -I 20\ self-hits were removed from 
the distribution. The length of the chromosome is ~ 10^. 
The dotted curve represents the solution of eq. 0 for the 
parameters computed for the natural chromosome D = 2000, 
fj, = 2 X 10“^, and A = 2 x 10“^. 

controlled by the parameter l/(C + 2), while the slope re¬ 
mains the same, but in new variables C has the form fx/Xa 
and as in the continuum limit a 0 the tail —3 vanishes 
unless at least ^/X ^ a. For small ( the dependence 
of the amplitude on the parameters /i and A disappears 
which corresponds to the observed saturation. 

COMPARISON TO NATURAL DATA 

For the comparison of our results with natural data 
we take C. elegans chromosome 2, for which we show 
the length distribution of exact matches on fig. As 
all synthetic sequences when processed with mummer do 
not contain “self-hits”, i.e., identical sequences located 
exactly in the same positions for both copies of the chro¬ 
mosome, the self-hits were also removed from the mum¬ 
mer output for the natural sequence. To estimate the 
parameters of our model for this chromosome we use the 
estimate for the duplication rate 0.0208 per gene, per 
lmj/(million years) or « 400 duplications occur in genes 
per Imylini, number of genes in the C. elegans 

genome is estimated to be around 2 x 10^ [501 : or /3o = 40 
per Imy for chromosome 2 of length ~ 10^ bases (as 
the length of the whole genome is taken to be ~ 10® 
bases); for the rate per base Aq we have /3o/Toj where Lq 
are bases in the C. elegans chromosome 2 belonging to 
genes. It is known that genes cover around 50% of the 
whole genome in C. elegans, hence Lq ~ 5 x 10®. We 
assume that the duplication rate for non-coding parts of 
the chromosome A = Aq ~ 10“® per base, per Imy. Then 
we find that XL = 100 duplications occur in coding and 
non-coding parts of C. elegans chromosome 2 per Imy. 

For the mutation rate in C. elegans we accept the esti¬ 


mate « 2 X 10“^ per base, per lm?/|5T]; one generation = 
four days. To map the parameters of the natural chromo¬ 
some to the model we use the estimate for the algebraic 
tail of the length distribution 2D^X/{fj,m^). This esti¬ 
mate follows from the prefactor in Q if we take into 
account that x fh = m/D and a — 1. The ampli¬ 
tude of the distribution for any specific m is estimated 
directly from the plot. In addition, it is necessary to take 
into account that A = Xmodei from § is related to the 
duplication rate in the natural chromosome Xnat = 10“® 
as Xmodei = LlXnatlo- From all previous estimates we 
obtain D « 2000 and Xmodei = 2 x 10“^. These esti¬ 
mates yield the solution of eq. Q shown in fig. The 
exact matches of the length > 200 observed in the fig. 
1 ^ imply that the realistic source of duplications should 
have non-zero variance unlike the delta source studied 
here. However, as it was shown in [T], such source does 
not influence the form of the tail for length distribution. 

DISCUSSION 

The solutions of the duplication-mutation dynamics 
presented in the paper raise a number of questions. 
For the explanation of heavy algebraic tails observed in 
length distributions of natural sequences we used the so¬ 
lutions of the equations for t —>■ oo. In connection with 
biology it should not be unserstood as an effort to say 
that natural sequences are in fact in a stationary state. 
First, the models studied here include only two processes 
having some analogies with processes in natural DNA. 
Therefore it would not be correct to interpret them as 
the models of how natural sequences have been varying 
in their history de facto. For example, in [T] we demon¬ 
strated that long range correlations detected in natural 
DNA were not found in the synthetic sequences obtained 
by means of these models; i.e., the length distributions 
merely reflect some important evolutionary features of 
natural DNA neglecting other features. Second, it is nec¬ 
essary to stress that basic assumptions of the model im¬ 
ply uniform mutation and duplication rates both in time 
and in space while in natural genomes these quantities 
may vary depending, e.g., on the function of a DNA re¬ 
gion. Nevertheless the correspondence of the solutions to 
the model and natural data demonstrates that the equa¬ 
tions detect essential details of the data. On the other 
hand, it is hardly possible to indicate a characteristic 
time scale for all eukariotic sequences on which signifi¬ 
cant evolutionary changes occurred to form the modern 
genomes. Therefore, as the time for natural sequences is 
restricted by the present moment, we do not have suffi¬ 
cient evidence to map this time moment to a specihc time 
moment of the model and the most plausible assumption 
is to map it to the stationary state of the model attained 
for t > 00 (in the units of the model). This assump¬ 
tion is confirmed by observations that stationary length 
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distributions of the model reproduce the length distri¬ 
butions of natural sequences. However, this should be 
rather understood as a sojourn of a non-stationary so¬ 
lution in the neighbourhood of the stationary one suffi¬ 
ciently long time compared to a characteristic time scale 
in the system rather than a “fixation” of natural genomes 
in stationary states and thus the stationary system ap¬ 
proximates well the natural DNA while the latter still 
may remain non-stationary. Obviously, if a natural chro¬ 
mosome demonstrates noticeable deviations from alge¬ 
braic tail or other deviations from stationary solution, 
the assumption of non-stationarity becomes possible and 
has to be studied separately. 

The equations have several features deserving to 
be stressed. First of all, the equations we derived for 
G{m) allow the length distributions of exact matches 
computed by mummer in a broad range of parameters 
to be reproduced correctly. That means, in part, that 
histograms computed by counting pairs of maximal ex¬ 
act matches with mummer can be understood as ^ igi{m) 
i.e., they represent a cummulative sum of all sequences 
of duplicates, triplicates, etc. It is worth noting that 
the mummer output does not compute functions gi(rn) 
directly and thus the question of interpretation of gi 
in terms of biologically meaningful sequences remains 
open: we observe only some cumulative effect of distri¬ 
butions for giirn). On the other hand, the correspon¬ 
dence of functions (72 (w) to the length distribution of 
supermaxmers indicates a potential way to resolve this is¬ 
sue: if functions 52 (m) were interpreted as supermaxmers 
then the candidates for (73(771), (74(777) etc. could be so 
called ‘local maxmers’ |12[ I22j . At the same time the ob¬ 
served correspondence of mummer output and the function 
G{m) suggests we have an analytic interpretation for the 
length distributions computed by mummer for natural 
sequences: the length distributions for natural sequences 
exhibiting algebraic behaviour with the exponent —3 can 
be understood in terms of equations (|^ and Q and their 
solutions. 

The representation G{m) = indicates 

that the function G{m) for each m can be thought of as 
average number of sequences i if gi{m) implies a non- 
normalized distribution function of the number of se¬ 
quences per one exact match over i. The equation 0 
has the form of a fragmentation equation with an input 
and thus can be construed as stationary fragmentation 
equation of these average quantities G{m). 

We also proposed a hierarchy of equations for gp, the 
first of these equations, i.e, for g 2 , was derived in [5] and 
we see that the equations of [5] and [5] as well as those 
presented here treat different subjects focusing on vari¬ 
ous restrictions imposed on exact matches; in part, the 
work in [3] deals with the collection of ‘supermaxmers’, 
specific pairs of exact repeats computed with additional 
conditions of maximality which are discussed in|12|(see 
Appendix 1); they are important as the equations for 


them not only account for the observed algebraic be¬ 
haviour in length distributions of natural DNA sequences 
but demonstrate, in part, non-algebraic length distribu¬ 
tions also observed both in simulations and natural DNA 
and also because their definition provides them with a 
natural biological interpretation [T^ . They are accounted 
for by equation ([^ and demonstrate obvious discrepancy 
from the length distribution of exact matches (suppl. fig. 
3). Our equation 0 treats all pairs of exact matches 
neglecting their uniqueness and reproduces their length 
distributions. Then G{m) in our interpretation may be 
represented as a sum of ‘supermaxmers ’for which the 
biological interpretation was already discussed and other 
sets of sequences obtained by natural extension of the 
concept of supermamxers; in this sense, we expect that 
such an interpretation of gi (m ), m > 2 will appear soon. 

The author is acknowledged to Kun Gao for helpful 
discussion. 


APPENDIX 1. TO THE DEFINITION OF EXCAT 
MATCHES 

In the appendix we provide more rigorous definitions of 
maximal repeats or matches which were used in the pa¬ 
per but which allow to distinguish the results presented 
here from those obtained earlier. There may be several 
approaches to the definition of exact matches and super- 
maximal repeats (cf. [H]); our approach construes the 
sequence as a set and thus all definitions are given in 
terms of sets and subsets. 

§1. Consider a finite sequence of objects Xi, z = 1, 2 ... L, 
L < 00 . For each element of the sequence there is a 
pair where i is the number of an element in the 

sequence^; hence, we have a set of pairs We 

denote this set by A. By Xk we denote a subset of X 
consisting of k pairs {z, x^}corresponding to k consecutive 
elements of the sequence. In the case of DNA sequences 
the sequence of the length L corresponds to the whole 
chromosome, or whole genome or even any long DNA 
sequence. 

§2. The configuration space is defined by possible val¬ 
ues of Xi- In general situation we can assume that this 
space S is the same for all sites of the sequence and 
S = {0,1, 2,..., A — 1}. Thus, we have possible 
states of the system. Consider also the set Y of all 
arbitrary A-ary sequences containing 1 < I < L ele¬ 
ments. This is a finite set with the cardinal number 
|r| = - l)/(^ - !)■ Elements of 

this set will be denoted by where index k implies the 


^ we use this redundant notation only for clarity. It is clear that 
notation {xi} is enough to denote the set of pairs, thus below 
may again denote the set of k pairs {k,yj^} 
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number of elements the corresponding sequence. The 
elements of yk are denoted yk = (l/fc, , 2/^)- For 
DNA sequences the configuration space has the form 
S={A,C,G,T}. 

Example. Let the configuration space be binary, i.e., 

S = {0,1}. Consider the sequence X = {10101010} for 
which L = 8. The set X is represented as follows 

A = {{1,1}, {2,0}, {3,1}, {4,0}, {5,1}, {6,0}, {7,1}, {8,0}}. 

For this set one of the A3S is given by 
{{2,0}, {3,1}, {4,0}}. The set Y consists of all bi¬ 
nary sequences containing I elements, 1 < I < 8. An 
example of an arbitrary j/4 is furnished by an arbitrary 
binary sequence of 4 elements. 

§3. We say that the element yk & Y intersects with 
the sequence Aif3a:l<a<L — /c such that 
Vk ~ Xa+j-i, j = 1,2 ,... A:. In our example the 
element 7/4 = {1010} intersects with X three times. 
The subsets of X corresponding to these intersections 
are given by A] = {{1,1}, {2,0}, {3,1}, {4, 0}}, 

A| = {{3,1}, {4,0}, {5,1}, {6,0}}, A| 

{{5,1},{6,0},{7,1},{8,0}}. 

Let the element yk & Y intersected with X and the 
intersection is given by the set {A^, Af,... A^}. We de¬ 
note that by yk = {A^, A|. .. XJi} where Xl C A,Vj. 

Definition 1. The element yk = {A^, A^ ... A^} € Y 
is referred to as sub-maximal k-mer if ft, > 1. 

Definition 1'. Each pair of sets (A^, A^), i ^ j of yk 
is referred to as exact match. 

Definition 2. Exact match {XI, X^), i ^ j is re¬ 
ferred to as maximal exact match if at least one of 
XI, Xl ^ A^_^p Vp > 1 and Vs such that A^,^^ e yk+p = 
{A^_i_p,... A^_|_p} where yk+p is a sub-maximal fc-fp-mer. 

Example. 

Consider the sequence 

TGGT GGTTA ATTCACA GGTTA CA GGTTA GGG 

Its subsequence GGTTA is a sub-maximal 5-mer with 
ft = 3. Each pair of three sequences of it forms an exact 
match. On the other hand, a maximal exact match is 
formed by any pair except that, containing the sequences 
2 and 3 as both these sequences turn out to be immersed 
into longer sub-maximal maxmer AGAGGTTA. This 
can be expressed in other words by saying that maximal 
exact matches can not be extended even by one symbols 
to the left or to the right to remain in the same time 
exact matches. 

§4. For further purposes we should notice that a sub- 
maximal fc-mer can be contained into another submaxi- 
mal k -fp-mer, p > 0 in the sense that it may occur that 
V A} there exists XIj^^-. X^ C Xl_^_p. This observation 
motivates the following definition. 

Definition 3. The sub-maximal ft-mer yk = 
{A^, A|, ... A{1} e F is referred to as local maxi¬ 
mal k-mer if for any sub-maximal maxmer yk+p = 


{A^+p, ■ • ■ where p > 1 3A^ e yk such that 

^i^H+p e Vk+p, j = 1... ft. 

Definition 4. A local maximal k-mer is referred to as 
a super maximal k-mer if the conditions of definition 3 
are valid for all Xl G yk- 

In the example above the subsequence AGAGGTA 
represents a supermaximal 7-mer, while three sequences 
GGTTA yield a local maxmer, as only the first such se¬ 
quence can not be extended while two other sequences 
can be extended to supermaximal maxmer AGAGGTA. 

It is seen that relations of maximal exact matches 
and supermaximal and local maximal maxmers are not 
straightforward. One may roughly say that the set of 
all supermaximal repeats would be a subset of all max¬ 
imal exact matches. However insignificant deviations 
from this inclusion can appear because we define maximal 
exact matches as pairs of elements while supermaxmers 
even for DNA sequences can consist of three sequences; 
but such supermaxmers are so rare that their influence 
is negligible and in a zeroth approximation we can rely 
on the relation indicated above. The connections to lo¬ 
cal maxmers are more subtle: from the example above 
it is clear that maximal exact matches are often “cho¬ 
sen” as pairs from local maxmers containing many se¬ 
quences. Though it is correct that supermaximal and 
local maxmers suggest more non-trivial division of re¬ 
peats in the chromosome, maximal exact matches as we 
defined them above provide an independent measure of 
non-local correlations in DNA. 


APPENDIX 2. TO THE DEFINITION OF 
LENGTH DISTRIBUTION. 

§5. Based on the previous definitions of various repeats 
we provide more rigorous treatment of the length distri¬ 
bution. 

Definition 5. The number of A^ containing in sub- 
maximal k-mer is referred to as index of the sub-maximal 
k-mer with respect to the set A and is denoted by 
Inx(yfc)- 

Thus liixivk) — h (cf. definition 1). This obvi¬ 
ously would correspond to introducing some indicator 
function on the set Y^. According to the dehnition 1, 
imny^ylnxiyk) = 2. In addition, the function Inx(yk) 
is non-negative and finite-valued. If the element yk is not 
a sub-maximal k-mer, then we put liix{yk) = 0- Tbe in- 


^ There may exist sensible definitions of index different from defi¬ 
nition 5, from which we mention the following: if y]^ is a submax- 
imal k-mer from def. 1 with > 1, then InxiVk) = 1 lor ^riy h. 
One may say that in definition 5 the index counts ’occurrences’ 
of a sequence in X, while in the last definition the number of 
sub-maximal k-mers is counted; this terminology is developed in 

m 






dex is defined similarly for all types of repeats introduced 
in §§3,4. 

§6. Let us introduce an equivalence relation on Y. Two 
elements of Y are equivalent if they are both sub-maximal 
A:-mers wrt. X. Thus, the set Y is partitioned into classes 
of equivalent elements. The set obtained by means of fac¬ 
torization of Y with respect to this equivalence relation 
is denoted by Yp . Thus, each element G Yp consists 
of all sequences y G Y oi k elements intersecting to X 
and included to some (sub)maximal fc-mer. 

The notion of index is easily redefined for arbitrary 
equivalence classes (not only for sub-maximal k-mer but 
for maximal exact matches or supermaxmers). These 
definitions are straightforward and we omit them. 

Definition 5’. If y^\y^'’ ■ • ■, y^'^ G Y are equivalent 
with respect to the equivalence relation F, then the index 
of the corresponding element y^ G Yp is given by 

In(yf) = (7) 

i=l 

§7. Example. We can consider the notion of index in 
application specifically to supermaxmers. In this case the 
configuration space is S' = {A^ T, C, G} and supermaxi- 
mal maxmers can contain 2,3 or 4 sequences^. Thus, 
according to definition 5 the corresponding indexes are 
equal to 2, 3 and 4. The space Yp is obtained by estab¬ 
lishing the equivalence of all supermaxmers, which have 
the same number of elements. 

The complete number of elements containing in y^ G 
Yp is given by Q. As each y GY belongs to at least one 
y ^, then Y is partitioned into equivalence classes with re¬ 
spect to supermaximal sequences. Consequently lii{y^) 
can be computed for any y^. Then we can introduce the 
following definition. 

Definition 6. The function n(fc) = In(y|’), y^ G Yp , 
fc = 1,2 ,... is referred to as empirical length distribution 
on Y wrt. X. 

§8. It is important to notice that the equivalence relation 
is constructed for studying some correlation properties of 
m-ary sequences, e.g., genomes, which do not depend on 
a concrete structure or content of these sequences but 
which would incorporate physical length as one of the 
governing parameters. In this context it should be un¬ 
derstood that there are many other ways to construct an 
equivalence relation or, in physical terms, coarse grain¬ 
ing on Y. However, these definitions typically neglect 
the physical length. The simplest way is to include only 
supermaximal A:-mers and neglect local ones. To give a 


in binary case, only two sequences. The number of supermaxmers 
with 3 or 4 sequences is negligible compared to those with two 
sequences. 


less obvious and exotic example we may say that two 
elements of Y are equivalent if, provided that configu¬ 
ration space is S' = {0,1}, they contain equal fractions 
of Is. This is especially easy to envisage for binary se¬ 
quences but also may be reasonable for arbitrary m-ary 
sequences. In part, the similar construction was applied 
in [23] to produce so called k spectra of genomes. As ge¬ 
netic ’alphabet’ consists of 4 letters the authors consider 
fc-mers with respect to the fraction of (A-l-T) content. 
In our terms that means introducing a different equiva¬ 
lence relation on the set Y than one mentioned above. 
On the other hand, we may consider the trivial equiva¬ 
lence relation when any y G Y is equivalent only to itself. 
This situation is ubiquitously exploited, e.g., in genomics 
where one can take a specific “functional” sequence and 
ask whether its copies are found in different genomes. 
In this situation the content of the sequence is not elimi¬ 
nated because the assumed functionality implies that any 
nucleotide may be important. The interesting example of 
manipulations with this limiting case of self-equivalency 
is given in |23] . 
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SUPPLEMENTAL FIGURES 
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FIG. SI: Comparisons of simulations with solutions of equation (4) of the main text. The parameters are: L = 10®, D — 10®, 
A = 10“®. Empirical length distributions were computed with the same switches of mummer as indicated in the caption for 
figure 1 of the main text. The distributions were averaged over 100 realizations. 
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FIG. S2: Parameters of the model are: L = 10®, D — 10^, A = 10 ^. All other parameters and options are the same as in 
figure 1 of the main text and supplemental figure 1. 
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FIG. S3: Length distributions obtained with duplication-mutation dynamics using mummer with the parameters -n -b -I 20. 
Parameters of the model are: L = 10®, D — 10®, A = 10“^ and correspond to those indicated in the fig. 1 of the main text. 
Magenta curves represent the solutions of the equation (2) of the main text. 












